<!doctype html>
<html lang="en">
<head>
    <meta charset="utf-8">
    <meta name="viewport" content="width=device-width, initial-scale=1">

    <link href="https://raw.githubusercontent.com/Andres6936/Tweak/master/Docs/css/bootstrap.min.css" rel="stylesheet" type="text/css">
    <link href="css/bootstrap.min.css" rel="stylesheet" type="text/css">

    <title>An Efficient Data Structure For A Hex Editor</title>
</head>
<body class="col-8 col-xxl-5 py-5 mx-auto">
<h1 class="display-6 fw-bold">An Efficient Data Structure For A Hex Editor</h1>
<p class="text-muted text-end">
    by <a href="http://pobox.com/~anakin/">Simon Tatham</a>
</p>
<h2 class="my-5"><a name="S1"></a>1. Introduction</h2>
<p>
    Hex editors have been around for a long time, and at the very basic level they are very simple
    to write. Since they are mostly used for editing files such as executables, which contain a lot
    of cross-references to particular byte positions in the file, a hex editor need not have an
    insert mode in order to be useful. And a hex editor without an insert mode is very easy to
    implement: you simply allocate a large enough array for the input file, and use that as your
    data structure. The only operation you really need to be able to do efficiently is to jump to a
    particular byte position, and that's precisely what an array makes easy.
</p>
<p>
    On the other hand, an insert mode can be useful in other circumstances. Not <em>all</em> types
    of file you might want to edit have the same restrictions as an executable. And as soon as you
    want your hex editor to have an insert mode, the data structure question becomes much more
    interesting.
</p>
<p>
    In this article I present an efficient and scalable data structure which supports all the
    operations needed by a hex editor.
</p>
<h2 class="my-5"><a name="S2"></a>2. Simple options</h2>
<p>
    One technique used to support insert mode in editors is to use an array larger than the file
    size, with a gap in it. The file contents up to the current cursor position are stored at the
    start of the array; the file contents from the current cursor position to the end are stored at
    the end of the array; and the gap in the middle moves about as the cursor does.
</p>
<p>
    This makes insertion easy. When the user inserts an extra character, you just add it to one end
    or other of the gap. On the other hand, <em>moving</em> through the file now becomes a slow
    operation; it's not noticeable when you're moving by a byte, by a line, or even by a screenful
    at a time, but as soon as you try to jump to the start or end of the file, or jump to a
    particular specified file offset, suddenly the editor has to bodily shift enormous amounts of
    file data from one end of the gap to the other.
</p>
<p>
    Another slightly better option is to use a linked list of small arrays, and to let the arrays
    vary in size between K and 2K bytes, for some fixed minimum block size K. Inserting a single
    byte in the middle of a block doesn't cost too much; occasionally the block will grow beyond
    size 2K and have to be split into two smaller ones, but even that isn't too slow.
</p>
<p>
    Jumping to a particular position, however, is still an O(N) operation using this structure. In
    practice it isn't <em>too</em> bad, since the length of the linked list is at worst 1/K times
    the size of the file; but once the file size becomes seriously big, this approach does not scale
    well.
</p>
<p>
    The common problem in both these methods is that as soon as you make insertion a constant-time
    operation, seeking to a given byte position becomes linear-time. Whereas in the original array
    format, of course, seeking was constant-time but <em>insertion</em> became linear-time.
</p>
<h2 class="my-5"><a name="S3"></a>3. Using balanced trees</h2>
<p>
    This is where trees come in. Balanced tree structures (any of AVL trees, red-black trees and
    B-trees) all solve this sort of problem for sorted lists. You can insert an element into a
    balanced tree in <em>log</em> time, and you can search for a particular element in log time as
    well. This sounds like the kind of compromise we want: if making insertion constant-time forces
    seeking to be linear and vice versa, we would prefer to arrange for <em>both</em> to be
    log-time.
</p>
<p>
    The conventional use of a balanced tree to store a sorted list, however, is not immediately
    helpful to us. The only criterion we could reasonably sort on would be byte position in the
    file; and as soon as we store our data as a set of (position, data) pairs, we're back to
    insertion being linear again, because we would have to alter the position field of every tree
    element after the insertion point.
</p>
<p>
    Is there anything we can do to our balanced trees to make this work better?
</p>
<h2 class="my-5"><a name="S4"></a>4. Counted trees</h2>
<p>
    Yes, there is.
</p>
<p>
    Suppose you add an additional field to every node of a balanced tree. In that field, you store a
    count of the number of elements <em>in or below</em> that node.
</p>
<p>
    Operations which alter the tree (insertion and deletion) now have to make sure these counts
    remain accurate. This can be done without sacrificing the log-time characteristics of the
    operations. For example, when you add an element, you increment the count of the node containing
    it, and then work back up the tree to the root incrementing the counts in all the nodes you go
    past. Since the height of the tree is O(log N), this only takes you O(log N) time.
</p>
<p>
    So we can add counts to a tree and still maintain it efficiently. What have the counts bought
    us?
</p>
<p>
    Once we have counts in a tree, they introduce an entirely new way to <em>search</em> the tree.
    Starting at the root, we can search down the tree by examining the count fields rather than
    comparing elements as usual; and this allows us to find the Nth item in the tree, for any N, in
    a single log-time search. For example, suppose the root tree node contains a child with count
    54, then an actual element, then a child with count 73. Then:
</p>
<ul>
    <li>
        If you are trying to get to a position less than 54, then you descend straight to the
        leftmost child.
    </li>
    <li>
        If you are trying to get to <em>exactly</em> position 54, you return the element out of the
        root node.
    </li>
    <li>
        If you are trying to get to position 55 or greater, you descend to the rightmost child, and
        subtract 55 from your desired position. (If you want element 57 of the tree, then you know
        there are 55 elements in the tree before the right-hand subtree, so you know you want
        element 2 of the right-hand subtree.)
    </li>
</ul>
<p>
    So now we have a means of finding the Nth item in a tree in a log-time search. This is starting
    to look promising.
</p>
<p>
    The trouble is, we're still stuck with having some sort of sorting order on the tree. Now we
    need to deal with that.
</p>
<h2 class="my-5"><a name="S5"></a>5. Unsorted trees</h2>
<p>
    The simple answer to the sorting problem is to do away with sorting the tree at all!
</p>
<p>
    Conventional balanced trees have a sorting order because it's used to find elements in the tree,
    and to know where to add an element. But we don't need a sorting order to find things any more,
    because we can use a count-based search to jump to the Nth position. Can we also use counts
    during the tree add operation, to allow us to specify <em>where</em> we want to add our new
    element?
</p>
<p>
    We can. Tree add algorithms start by searching down the tree to find the position where the new
    element will be inserted. If we do this search using counts, in exactly the same way described
    in <a href="#S4">section 4</a>, then we can add any element we like at any position in the tree.
    Once we do this, of course, we have to throw out the sorting order completely, and never do
    another order-based search or insertion again, because they won't work. But that's OK, because
    we didn't need them anyway.
</p>
<p>
    Now we have exactly what we were after in the first place. We have a data structure which stores
    an unordered list of items, in such a way that we can insert or delete an item in log time <em>and</em>
    find the Nth element in log time.
</p>
<h2 class="my-5"><a name="S6"></a>6. Splitting and joining trees</h2>
<p>
    Now we can begin to get more ambitious. One issue we have not addressed yet is cut and paste.
</p>
<p>
    So far I have discussed tree insertion in the assumption that you only ever insert one character
    at a time into your tree. In fact hex editors need cut and paste just as much as normal text
    editors do; so we must think about how to insert or remove a larger block of data at a time.
</p>
<p>
    One obvious way is to process each byte individually. A ten-byte cut operation is ten individual
    deletions, and a ten-byte paste is ten individual insertions. This is fine if you only ever use
    cut and paste to move tiny chunks of data around a large file, but if you need to move <em>half
    the file</em> from one place to another, things get more interesting.
</p>
<p>
    The linked-list structure discussed in <a href="#S2">section 2</a> would have helped a lot with
    this problem. Linked lists don't just make it easy to insert or delete one item: they make it
    just as easy to unlink an enormous chunk of a list once you've found both ends of the chunk, and
    you can link that chunk in somewhere else easily as well.
</p>
<p>
    It turns out that you <em>can</em> do the same thing with balanced trees. At this point it
    starts to make a difference what kind of balanced tree you use: all three of AVL, red-black and
    B-trees support these operations, but the precise methods vary between them. I'm going to use
    B-trees from here on, because the algorithms are slightly simpler.
</p>
<p>
    What we need are two basic operations. Given a counted, unsorted B-tree containing an unordered
    list of items, we need to be able to:
</p>
<ul>
    <li>
        Split the tree down the middle, giving two valid B-trees as output.
    </li>
    <li>
        Take two valid B-trees and join them together end-to-end, giving one B-tree containing all
        the data from tree A followed by the data from tree B.
    </li>
</ul>
<p>
    This will provide all the operations we need. To unlink a large section from the middle of a
    tree, we split it in two places and then join the outer two parts back together; to link a large
    section <em>into</em> the middle of a tree, we split it at the insertion point, join the left
    half on to the left side of the inserted section, and join the right half on to the right side
    of the inserted section.
</p>
<h3 class="my-5"><a name="S6.1"></a>6.1. Joining two B-trees together</h3>
<p>
    When you add an element to a B-tree, sometimes it ends up increasing the size of a leaf node
    beyond the size limit. When that happens, you deal with it by splitting the node in two, and
    transforming the parent node so that where it previously had a single child pointer, it now has
    two child pointers with an element between them. If that makes the parent node too big as well,
    you do the same thing again, and so on until you reach the tree root.
</p>
<p>
    Joining two B-trees is therefore reasonably simple, <em>if</em> you have an additional
    separating element to place in between them. Position the two trees so that their leaf nodes are
    at the same level; now (usually) one tree will be shorter than the other. So you can add the
    root of the shorter tree as a sibling of the node next to it in the taller tree; their common
    parent gains one extra child pointer (pointing at the root of the shorter tree), separated from
    its neighbour by the additional separating element. If this causes the node to increase beyond
    the maximum size, just split it in two and propagate up to its parent, just as in the ordinary
    insertion process.
</p>
<p>
    If the trees were originally the same height, just combine their root nodes into a single larger
    root node. You need an extra element to go in between the rightmost child pointer of the
    left-hand root node, and the leftmost child pointer of the right-hand root node; and again, this
    is where your separating element comes in. Again, if the new root is too big to be a single
    node, split it in two and create a new root above it.
</p>
<p>
    So it turns out that it's very easy to join two trees together, but the algorithm requires a
    spare element to go in the middle. However, we normally don't have such a spare element: we just
    have two trees. This is easily solved, though: we simply start by removing the leftmost element
    of the right-hand tree using the ordinary tree deletion algorithm. Then we just do the join
    algorithm, as described above, using the element we just removed as our separator.
</p>
<h3 class="my-5"><a name="S6.2"></a>6.2. Splitting a B-tree in two</h3>
<p>
    To split a B-tree in two: we are given a tree, and a means of searching down the tree to find
    the split point. (In this application, that will be a numeric position, which we check against
    the node counts on the way down; in other situations, we might perfectly well want to split an
    ordinary <em>sorted</em> B-tree in half, so we might have an ordering-based search criterion. It
    makes no difference.)
</p>
<p>
    We start in the simplest possible way. Start at the root node; decide which of its subtree
    pointers you are going to descend down; and saw the node in half at that subtree pointer. The
    two half-nodes thus created will <em>each</em> need a subtree pointer to go on the cut end, but
    that's OK because we're about to saw the next node down in half as well and they can have half
    each. So descend to the next node, decide on a split point again, saw that node in half, and put
    a pointer to each half in the two halves of the parent node.
</p>
<p>
    Once we finish this searching-and-cutting pass, we will have successfully separated our tree
    into two parts at the required point. However, the result will almost certainly not be a pair of
    <em>valid</em> B-trees; the chances are that many of the nodes on the cut edges will be below
    the minimum allowed node size. In fact, if at any point our search criterion made us descend
    through the <em>endmost</em> subtree pointer in any node, some of those nodes will have no
    elements in them whatsoever, just a single subtree pointer!
</p>
<p>
    So now we must make a healing pass down the cut edge of each tree, to turn it back into a valid
    B-tree. We can start by throwing away the root node if it has nothing but a single subtree
    pointer (which will happen quite often if we split near one end of the original tree, since in
    that case the output trees will almost certainly need to be of different heights). Keep doing
    that until we find a real root node.
</p>
<p>
    One child of that node is on the cut edge, so it may be below the minimum size. If it is, we
    solve this using its (valid) neighbour node. If the neighbour is large, we can move some
    subtrees over into the undersized node to make two correctly sized nodes; if the neighbour is
    too small and does not have that many subtrees to spare, we can instead <em>combine</em> the
    undersized node with its neighbour. (And it turns out you can always do at least one of these:
    if the neighbour is too large to combine with the undersized node, then it <em>must</em> have
    enough subtrees for redistribution to give two viable nodes.)
</p>
<p>
    The only interesting case is that combining an undersized node with its neighbour reduces the
    number of subtrees of their common parent by one. Therefore:
</p>
<ul>
    <li>
        As we go down, we arrange for each node on the cut edge to be at least <em>one more
        than</em> minimum size, in case its size must drop by one when we process its child. (This
        still just about works in all cases.)
    </li>
    <li>
        If the first non-trivial root node had only two children (recall that the root node in a
        B-tree is the only node exempt from the minimum size limit), and those two children end up
        having to be combined, then the root node must be thrown away again and the combined node is
        the new root.
    </li>
</ul>
<p>
    Once we have sorted out each node, we descend to its child on the cut edge, and do the same
    thing again. Eventually we reach the bottom of the tree and every node is of valid size. Then we
    do the same thing to the cut edge of the other tree, and we're done.
</p>
<h2 class="my-5"><a name="S7"></a>7. Cloning trees</h2>
<p>
    The splitting and joining algorithms look as if they make cut and paste pretty much trivial. You
    can split a big chunk out of your editing buffer into a separate cut buffer easily enough; and
    then you can &#8216;paste&#8217; it somewhere else by joining it back into the middle of the
    editing buffer at a different position.
</p>
<p>
    However, in real life, cut and paste isn't that simple. People often want to paste the same data
    more than once; so you can't just link the cut buffer straight into the editing buffer, because
    then you don't still have it to link in again next time. You need to <em>copy</em> the cut
    buffer and link in the copy. Equally, users often want to press Copy rather than Cut, in which
    case you have to split the buffer tree in two places, <em>copy</em> the middle section, and join
    all three back together.
</p>
<p>
    Copying a tree, it would seem, is inherently an O(N) operation; there's no way you can copy a
    tree containing megabytes of data without actually copying all that data.
</p>
<p>
    Or is there?
</p>
<p>
    It turns out that we <em>can</em> do better than this, by adding another annotation field to
    each tree node. This time, the annotation is a <em>reference count</em>: it counts the number of
    pointers to the node, either from other tree nodes or from the &#8216;root&#8217; field in a
    tree header structure. To begin with, of course, all reference counts are 1.
</p>
<p>
    Reference counts are normally used for garbage collection. In this case, though, I'm going to
    use them to implement <em>copy-on-write</em>. All of the tree-altering algorithms (insertion and
    deletion, plus the split and join algorithms described above) will now check the reference count
    of a node before attempting to modify it. If they find that they need to modify a node with a
    reference count greater than one, they will not modify it. Instead, they will make a copy of
    that node, and use the copy in place of the original. The copy links to all the same child nodes
    as the original, so the reference count in each child must be incremented; and the copied node's
    parent (or tree header structure) now links to the copy rather than to the original, so the
    reference count in the original must be decremented. Now we are looking at a node with a
    reference count of 1, which means nobody else is using it so we can modify it safely.
</p>
<p>
    The effect of this is that it is now a trivial - not merely log-time but <em>constant</em>-time
    - operation to <em>clone</em> an entire B-tree, no matter how large. We simply create a new tree
    header structure; we point its root field at the root node of the input tree; and we increment
    the reference count on that root node.
</p>
<p>
    Once we have cloned a tree like this, we can treat the original and the clone as if they were
    entirely independent. If you add an element to one of them, for example, then a single string of
    nodes from the root down to one leaf will be duplicated and modified, but the rest of the trees
    will still be held in common. You can split either tree into lots of little pieces, or join it
    into the middle of a larger one, and never affect the data stored in what was once its clone,
    because every time you touch a node that the other tree is depending on, you make your own copy
    rather than disturbing it.
</p>
<p>
    This allows us to support <em>really</em> efficient cut and paste in our hex editor. You select
    a 200Mb chunk and press Copy; the buffer tree is split in two places (in log time), the middle
    section is cloned (instantly), and the tree is joined back together. You'd hardly know anything
    was different - but the cut buffer now contains a clone of <em>part</em> of the original buffer,
    most of which consists of nodes that are still shared with it. And you can paste in as many
    copies as you like of that chunk, still in no worse than O(log N) time. The best bit is that by
    the time you've done this a few times and have a file that's 1600Mb longer than it started out,
    the hex editor isn't actually using up 1600Mb more memory, because most of it is in shared
    nodes! This technique naturally provides a form of compression as well as being fast.
</p>
<h2 class="my-5"><a name="S8"></a>8. Lazy file loading</h2>
<p>
    In all of the above I have been tacitly assuming that the data elements stored in my tree are
    individual bytes. This would be hideously inefficient if I were using AVL or red-black trees, in
    which each node contains precisely one element: for every <em>byte</em> of the file being
    edited, there would be an overhead of two child pointers, a byte count and a reference count. On
    a normal 32-bit machine, that's 20 bytes per node, not counting overhead from the memory
    allocator. A factor of twenty is just ridiculous.
</p>
<p>
    B-trees are a bit more flexible, since they can be made to have a large minimum degree. A B-tree
    with a minimum node size of (say) 512 can contain up to 1023 bytes of data plus 1024 subtree
    pointers, and those 1023 bytes can be packed together in memory so the overhead is now more like
    a factor of five. Also, since no node in a B-tree ever changes its height above ground level,
    you can just not bother to allocate space for the 512 NULL child pointers in your leaf nodes,
    and since the vast majority of your nodes will <em>be</em> leaf nodes, the structure is now
    closer to being space-efficient.
</p>
<p>
    There are other improvements one could make. For example, there's no reason why a B-tree really
    needs to have the <em>same</em> minimum node degree at every level; so you could have low-degree
    nodes everywhere above the leaf level, and enormous leaf nodes containing 4-8Kb of file data.
    You could move to B+ trees in which no actual data elements were stored anywhere except in the
    leaf nodes, thus saving the tiny alignment overheads in the other nodes.
</p>
<p>
    However, there's a better direction to head in. In <a href="#S2">section 2</a> I mentioned the
    idea of using a linked list as the main data structure, and I said that each element of the
    linked list would be a smallish array of file bytes (between size K and 2K). There's no reason
    we couldn't do that in our B-tree-based approach: each element stored in the B-tree is no longer
    a single byte but a small block of bytes. It would mean that our element counts no longer
    allowed us to jump to the Nth byte, only to the Nth <em>block</em>; but we can fix that by
    replacing the element count with a byte count, summing the total <em>size</em> of all the blocks
    in or below a particular tree node. Now, given any byte position, we can do a single log-time
    search and return a data block plus an offset within that block.
</p>
<p>
    This technique adds work to all operations. Inserting a byte, for example, is now done by
    finding the block it needs to go into, inserting it in that block, and potentially splitting the
    block into two and doing an extra tree operation. Splitting and joining buffers involves
    splitting and joining blocks at each end, and checking to make sure undersized blocks are not
    created. So what does this technique buy us, that makes it worthwhile over just storing single
    bytes in the B-tree?
</p>
<p>
    The answer is: once we have a block data structure as our tree element, we can start having
    different <em>types</em> of block. In particular, we can have a type of block which is a
    placeholder, containing nothing but a file offset and length. A block of this type indicates
    &#8216;at this point in the tree we have N bytes from position P in the original file&#8217;.
    Blocks of this type are exempt from the normal maximum size for normal literal-data blocks.
</p>
<p>
    The effect of this is that we no longer need to read the entire file into memory when we start
    up. Instead, we just initialise our tree trivially, so that it contains nothing but a single
    placeholder block, with offset zero and size equal to the initial file size.
</p>
<p>
    Now whenever we need to read data from the tree, and it turns out the data in question is
    somewhere in a placeholder block, we must refer back to the original input file in order to find
    the data (and the placeholder block will tell us where in the file to read it from). So before
    we do any editing, our hex editor is suddenly a low-cost hex <em>file viewer</em>, which just
    pages back and forth and refers to the disk all the time.
</p>
<p>
    But as soon as we start altering parts of the file, the placeholder block gets broken up into
    smaller blocks, and literal-data blocks begin to be created in between them. If we cut and paste
    a section including a placeholder block, then the tree can end up containing placeholder blocks
    in a strange order; it might (for example) indicate something like &#8216;the first 192K of the
    input file; then the literal bytes 5B 49 A7; then 25K of the input file starting from position
    12345; then 512K of the input file starting from position 204325&#8217;.
</p>
<p>
    Now the hex editor <em>looks</em> as if it's doing exactly the same thing as it did to begin
    with. I can page around the original file; I can insert, delete, overwrite, cut, copy and paste
    to my heart's content, and (provided no other process modifies the original file under our feet)
    the data I am manipulating will remain consistent at all times with the editing operations I
    have performed. But there wasn't a big delay at startup when the file was loaded in, because
    most of it <em>wasn't</em> loaded in; and if I list the running processes on my system, the hex
    editor will not be using memory proportional to the size of the file. It will only be using
    memory proportional to the <em>changes</em> I've made to the file.
</p>
<p>
    When I save the file, if there are any placeholder blocks remaining in the buffer tree, the hex
    editor must write out the new version by referring to the original. This is the <em>only</em>
    remaining operation, apart from searching, that takes time proportional to the size of the file.
    And there are <em>no</em> remaining operations which take <em>memory</em> proportional to the
    size of the file.
</p>
<p>
    (There is one thing you need to be careful of. Literal data blocks must be permitted to fall
    below the minimum size K if there is no literal block next to them to merge with; in particular,
    this is vital if you are writing a binary file from scratch or you would never be able to give
    it a size between zero and K. But this raises the possibility that given a pathological sequence
    of editing operations, your data structure might end up being an interleaving of one-byte
    literal blocks and one-byte placeholder blocks, giving a huge space overhead. The simplest
    solution to this is to impose a minimum size of 2K on <em>placeholder</em> blocks, below which
    you read the relevant piece of file data and convert them into literal blocks; then they can be
    merged with adjacent blocks and the worst case is no longer terrible.)
</p>
<p>
    We now have a data structure which does pretty much everything you could reasonably ask a hex
    editor to be able to do, and does it all at a reasonable memory cost and (apart from the two
    genuinely necessary operations of searching and saving) <em>all</em> in O(log N) time.
</p>
<h2 class="my-5"><a name="S9"></a>9. Further directions</h2>
<p>
    The data structure as I have presented it is suitable for use in a high-performance hex editor
    with an insert mode.
</p>
<p>
    There are a couple more points worth noting.
</p>
<h3 class="my-5"><a name="S9.1"></a>9.1. Conventional text editing</h3>
<p>
    This structure would need only minor modifications to be an efficient basis for a conventional
    text editor. In order to do this, you would need to be able to jump quickly to a particular <em>line</em>
    of the file, which means you'd need a node annotation counting newlines.
</p>
<p>
    In fact, it's possible to do slightly better than that: we can devise a more complex node
    annotation which tracks the effect of an arbitrary byte sequence on the (line, column) position.
    Assuming that a physical tab character always advances the cursor to the next multiple of 8
    spaces, there are three possibilities:
</p>
<ul>
    <li>
        A sequence of bytes containing no newlines or tabs simply adds some number A to the column
        number, and does not affect the line number.
    </li>
    <li>
        A sequence of bytes containing no newlines but at least one tab has the overall effect of
        adding some number A to the column, and rounding it up to the next number that is congruent
        to B mod 8.
    </li>
    <li>
        A sequence of bytes containing at least one newline has the effect of adding some number A
        to the <em>line</em> number, and setting the column number to a fixed value B.
    </li>
</ul>
<p>
    These three function schemas are closed under composition (i.e. combining any two of them gives
    another one). Storing one in each node of a buffer tree would provide the ability to search
    directly to <em>a particular (line, column) position</em> in a single log-time search. So the
    text editor could treat its buffer as a simple sequence of bytes (or possibly of Unicode
    characters). This is superior to treating the buffer as a sequence of lines, because it removes
    the distinction between inserting <em>within</em> a line and inserting data <em>between</em>
    lines. In particular, cut and paste in a line-based model is fiddly because lines must be
    spliced together at each end of the pasted region; but cut and paste in this model is as trivial
    as it was in the hex editor - you just cut a sequence of bytes, paste it somewhere else, and the
    line/column indexing automatically keeps up no matter what you do.
</p>
<p>
    The only snag is that if you did this, you would probably no longer be able to do the trick with
    placeholder blocks and lazy file loading; a text editor tends to need to know in advance where
    all the newlines are in its buffer, so there would probably be no alternative to physically
    loading the file. But in that, at least, this data structure is no worse than any other.
</p>
<h3 class="my-5"><a name="S9.2"></a>9.2. Supporting undo</h3>
<p>
    An undo function in an editor <em>conceptually</em> stores a sequence of previous buffer states,
    and allows you to return to one of them when you need to.
</p>
<p>
    Usually, this is not actually implemented by storing copies of the entire buffer, since that
    would be ludicrously wasteful of space! Instead, a journal of changes is kept which allows
    previous buffer states to be <em>reconstructed</em> by reversing the precise changes made.
</p>
<p>
    One could do that using this data structure, if one wanted to. However, there's another
    intriguing option. Since cloning an arbitrarily large tree is a cheap operation, you could
    implement undo by <em>actually</em> storing a sequence of clones of previous buffer states! The
    cost of this would be nothing like as bad as it would na&#239;vely appear.
</p>
<p>
    It might still not be ideal, though. Every time you clone a tree and the two clones diverge,
    several nodes must be copied, and if each node contains several blocks of literal data then the
    cost of maintaining too many buffer clones might still become prohibitive. But it's an
    interesting possibility regardless.
</p>
<h2 class="my-5"><a name="S10"></a>10. Summary</h2>
<p>
    I've presented a design for a data structure which implements practically every operation
    required for a hex editor in O(log N) time, apart from one or two which genuinely <em>need</em>
    to be O(N).
</p>
<p>
    The structure is:
</p>
<ul>
    <li>
        A B-tree, each of whose elements is either a small array of literal data bytes, or a
        placeholder block denoting a section of the unmodified input file.
    </li>
    <li>
        Each B-tree node is annotated with the total byte count of all the elements in or below that
        node, allowing a log-time search to pinpoint any numeric byte position.
    </li>
    <li>
        Those counts provide the only necessary means of navigating the tree, so there is no need
        for a sorting criterion.
    </li>
    <li>
        Split and join algorithms make it possible to link and unlink large chunks from the middle
        of a buffer at a time.
    </li>
    <li>
        Reference counts implementing copy-on-write allow cloning of chunks in constant time.
    </li>
</ul>
<p>
    As a result:
</p>
<ul>
    <li>
        Inserting or deleting bytes in the file is a log-time operation.
    </li>
    <li>
        Finding a particular byte position is a log-time operation.
    </li>
    <li>
        Cut and paste is always log-time, no matter how large or complex the chunk of data being
        moved around.
    </li>
    <li>
        Memory usage grows proportionally to the <em>changes</em> made to the file, not the overall
        file size. (However, memory usage is also <em>bounded</em> by a value proportional to the
        file size, even if you keep editing and re-editing for ever.)
    </li>
</ul>
<p>
    Searching must still be linear (there's no alternative to actually reading the data if you need
    to know anything about its contents), and saving the modified output file is linear (because you
    actually must physically write out that much data), but <em>everything</em> else can be done in
    log time.
</p>
<p>
    I've also sketched a means of converting this into a data structure for an ordinary text editor,
    and suggested interesting implications in the area of undo operations.
</p>
<h2 class="my-5"><a name="S11"></a>11. References</h2>
<p>
    Donald Knuth's &#8216;The Art of Computer Programming&#8217; (<a
        href="http://en.wikipedia.org/w/wiki.phtml?title=Special:Booksources&amp;isbn=0201485419">Addison-Wesley,
    ISBN 0201485419</a>) presents at least some of the same ideas as this article. Counted and
    unsorted trees are mentioned in volume 3; splitting and joining are also described (although
    Knuth does them on AVL trees, which are significantly more fiddly to split than B-trees; you
    have to cut the tree into lots of little pieces, and then put them all back together by using
    the join algorithm repeatedly).
</p>
<p>
    &#8216;Tweak&#8217;, a hex editor implementing this data structure, can be downloaded at <a
        href="http://www.chiark.greenend.org.uk/~sgtatham/tweak/"><code>http://www.chiark.greenend.org.uk/~sgtatham/tweak/</code></a>.
</p>

<hr>
<address class="text-muted text-end">
    [tweak version 3.02]
</address>
<script src="https://github.com/Andres6936/Tweak/blob/master/Docs/js/bootstrap.bundle.min.js"></script>
<script src="js/bootstrap.bundle.min.js"></script>
</body>
</html>
